Empiric Introduction to Light Stochastic Binarization

نویسنده

  • Daniel Devatman Hromada
چکیده

We introduce a novel method for transformation of texts into short binary vectors which can be subsequently compared by means of Hamming distance measurement. Similary to other semantic hashing approaches, the objective is to perform radical dimensionality reduction by putting texts with similar meaning into same or similar buckets while putting the texts with dissimilar meaning into different and distant buckets. First, the method transforms the texts into complete TFIDF, than implements Reflective Random Indexing in order to fold both term and document spaces into low-dimensional space. Subsequently, every dimension of the resulting low-dimensional space is simply thresholded along its 50th percentile so that every individual bit of resulting hash shall cut the whole input dataset into two equally cardinal subsets. Without implementing any parameter-tuning training phase whatsoever, the method attains, especially in the high-precision/low-recall region of 20newsgroups text classification task, results which are comparable to those obtained by much more complex deep learning techniques.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Better Synchronous Binarization for Machine Translation

Binarization of Synchronous Context Free Grammars (SCFG) is essential for achieving polynomial time complexity of decoding for SCFG parsing based machine translation systems. In this paper, we first investigate the excess edge competition issue caused by a leftheavy binary SCFG derived with the method of Zhang et al. (2006). Then we propose a new binarization method to mitigate the problem by e...

متن کامل

LCFRS binarization and debinarization for directional parsing

In data-driven parsing with Linear Context-Free Rewriting System (LCFRS), markovized grammars are obtained through the annotation of binarization non-terminals during grammar binarization, as in the corresponding work on PCFG parsing. Since there is indication that directional parsing with a non-binary LCFRS can be faster than parsing with a binary LCFRS, we present a debinarization procedure w...

متن کامل

Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization

It has recently been shown that different NLP models can be effectively combined using dual decomposition. In this paper we demonstrate that PCFG-LA parsing models are suitable for combination in this way. We experiment with the different models which result from alternative methods of extracting a grammar from a treebank (retaining or discarding function labels, left binarization versus right ...

متن کامل

Extreme Value Theory Based Text Binarization In Documents and Natural Scenes

This paper presents a novel image binarization method that can deal with degradations such as shadows, nonuniform illumination, low-contrast, large signal-dependent noise, smear and strain. A pre-processing procedure based on morphological operations is first applied to suppress light/dark structures connected to image border. A novel binarization concept based on difference of gamma functions ...

متن کامل

Using Physically Based Rendering to Benchmark Structured Light Scanners: Appendix

Although the simulator generates realistic images, we need to verify that the range scans we produce using the synthetic scanner contain artifacts similar to those acquired in real scans. A common operation performed when using binary coded patterns is the binarization process. This identifies each pixel as either lit or unlit. Our validation procedure, which is based on this binarization opera...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014